Project Part 4: Merging & Querying All South Korean Air Pollution Data¶

Author: Robert Zacchigna¶

Table of Contents¶

Collected Datasets - Seoul (Districts), South Korea Air Pollution
Imports
Classes
Definitions

Loading AirData

Plotting AirData

Collected Datasets - Seoul (Districts), South Korea Air Pollution¶

Collected Datasets Location: Reformed_Data (Folder)

CSV: CSV-Air_Pollution_Data-(Reformed_and_AQI_Values).csv

Measurement date – Date and time of air pollution measurement
Station code – Station identification code
Address – Street address of air pollution measurement station
Latitude – Latitude of air pollution measurement station
Longitude – Longitude of air pollution measurement station
SO2 – Sulfur Dioxide Measurement (ppm)
NO2 – Nitrogen Dioxide Measurement (ppm)
O3 – Ozone Measurement (ppm)
CO – Cobalt Measurement (ppm)
PM10 – Size 10 Micrometer Diameter Particulate Matter (PM) Measurement (Mircrogram/m3)
PM2.5 – Sized 2.5 Micrometer Diameter Particulate Matter (PM) Measurement (Mircrogram/m3)
AQI_(PM2.5) – Air Quality Index (AQI) value of size 2.5 micrometer diameter particulate matter (PM)
AQI_Risk_Level – AQI risk level as outlined by the EPA (Environmental Protection Agency) of the USA.

Website: Website-Air_Pollution_Data-(Reformed_and_AQI_Values).csv

Measurement_Date – Date and time of air pollution measurement
Country – Country of Origin of the air pollution reading (South Korea)
City – City of Origin of the Air Pollution Reading (Seoul)
District – District of origin of the air pollution reading (Seoul Districts)
PM2.5 – Sized 2.5 Micrometer Diameter Particulate Matter (PM) Measurement (Mircrogram/m3)
AQI_(PM2.5) – Air Quality Index (AQI) value of size 2.5 micrometer diameter particulate matter (PM)
AQI_Risk_Level – AQI risk level as outlined by the EPA (Environmental Protection Agency) of the USA.

API: API-Air_Pollution_Data-(Reformed_and_AQI_Values).csv

Measurement_Date – Date and time of air pollution measurement
Country – Country of Origin of the air pollution reading (South Korea)
City – City of Origin of the Air Pollution Reading (Seoul)
District – District of origin of the air pollution reading (Seoul Districts)
Latitude – Latitude of air pollution measurement station
Longitude – Longitude of air pollution measurement station
PM10 – Size 10 Micrometer Diameter Particulate Matter (PM) Measurement (Mircrogram/m3)
PM2.5 – Sized 2.5 Micrometer Diameter Particulate Matter (PM) Measurement (Mircrogram/m3)
AQI_(PM2.5) – Air Quality Index (AQI) value of size 2.5 micrometer diameter particulate matter (PM)
AQI_Risk_Level – AQI risk level as outlined by the EPA (Environmental Protection Agency) of the USA.

Imports¶

import sqlite3
import numpy as np
import pandas as pd
import seaborn as sb
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib.colors as colors

from scipy import stats

Classes¶

This class is for changing the midpoint on the color bar for graphs, see definition CSVandAPI_GradientPlot(c)

# Set the Colormap and Center on the Colorbar
class MidpointNormalize(colors.Normalize):
    """
    Normalize the colorbar so that diverging bars work there
    way from either side from a prescribed midpoint value

    e.g. im=ax1.imshow(array, norm=MidpointNormalize(midpoint=0.,vmin=-100, vmax=100))
    """
    
    def __init__(self, vmin=None, vmax=None, midpoint=None, clip=False):
        self.midpoint = midpoint
        colors.Normalize.__init__(self, vmin, vmax, clip)

    def __call__(self, value, clip=None):
        x, y = [self.vmin, self.midpoint, self.vmax], [0, 0.5, 1]
        
        return np.ma.masked_array(np.interp(value, x, y), np.isnan(value))

Definitions¶

This function is for initially creating the database tables to load in the CSV data that has been collected

def createSQL_Tables(c):
    # CSV Table
    c.execute('''CREATE TABLE IF NOT EXISTS CSV_Data (
                    Measurement_Date DATE, 
                    Measurement_Time TIME, 
                    Station_Code INTEGER, 
                    Country VARCHAR2,
                    City VARCHAR2, 
                    District VARCHAR2, 
                    Street_Address VARCHAR2, 
                    Latitude INTEGER, 
                    Longitude INTEGER, 
                    SO2 INTEGER,
                    NO2 INTEGER, 
                    O3 INTEGER, 
                    CO INTEGER, 
                    PM10 INTEGER, 
                    PM25 INTEGER, 
                    AQI_PM25 INTEGER, 
                    AQI_Risk_Level VARCHAR2
            )''')

    # Website Table
    c.execute('''CREATE TABLE IF NOT EXISTS Website_Data (
                    Measurement_Date DATE, 
                    Country VARCHAR2, 
                    City VARCHAR2,
                    District VARCHAR2, 
                    PM25 INTEGER, 
                    AQI_PM25 INTEGER, 
                    AQI_Risk_Level VARCHAR2
            )''')

    # API Table
    c.execute('''CREATE TABLE IF NOT EXISTS API_Data (
                    Measurement_Date DATE, 
                    Country VARCHAR2, 
                    City VARCHAR2, 
                    District VARCHAR2, 
                    Latitude INTEGER, 
                    Longitude INTEGER, 
                    PM10 INTEGER, 
                    PM25 INTEGER, 
                    AQI_PM25 INTEGER, 
                    AQI_Risk_Level VARCHAR2
            )''')

    conn.commit()

This function is for loading in all the data from the CSV files into the database tables.

def loadData(c):
    # Read in AirData
    airData_CSV = pd.read_csv('Reformed_Data/CSV-Air_Pollution_Data-(Reformed_and_AQI_Values).csv')
    airData_Website = pd.read_csv('Reformed_Data/Website-Air_Pollution_Data-(Reformed_and_AQI_Values).csv')
    airData_API = pd.read_csv('Reformed_Data/API-Air_Pollution_Data-(Reformed_and_AQI_Values).csv')
    
    # Load data into SQL tables
    airData_CSV.to_sql('CSV_Data', conn, if_exists='replace', index=False)
    airData_Website.to_sql('Website_Data', conn, if_exists='replace', index=False)
    airData_API.to_sql('API_Data', conn, if_exists='replace', index=False)

    print('CSV Data\n')
    for row in c.execute('SELECT * FROM CSV_Data limit 5'):
        print(row)

    print('\n\nWebsite Data\n')
    for row in c.execute('SELECT * FROM Website_Data limit 5'):
        print(row)

    print('\n\nAPI Data\n')
    for row in c.execute('SELECT * FROM API_Data limit 5'):
        print(row)

This functions plots the histograms, with a normalization curve, of each of the data tables AQI values

def tableDistPlots(c):
    tableList = []

    for row in c.execute("SELECT name FROM sqlite_master WHERE type='table'"):
        tableList += [row[0]]

    for table in tableList:
        c.execute("SELECT [AQI_(PM2.5)] FROM " + table)

        AQI_Values = []

        for row in c.fetchall():
            AQI_Values += [row[0]]
        
        fig = plt.figure()
        fig.set_size_inches(16, 10)

        plot = sb.distplot(AQI_Values, fit=stats.norm, kde=False, fit_kws={"label": "Normalization Curve"})
        plot.set(title=table + ' Table, AQI Values Histogram', xlabel='AQI Values')

        plt.legend()

This function plots a scatter plot of the AQI values by district using the CSV, Website and API database tables

def CSV_Website_API_ScatterPlot(c):
    districtList = []
    AQI_Values = []

    # CSV, Website and API Data
    for row in c.execute('''SELECT District, [AQI_(PM2.5)] FROM CSV_Data 
                            UNION SELECT District, [AQI_(PM2.5)] FROM Website_Data 
                            UNION SELECT District, [AQI_(PM2.5)] FROM API_Data'''):
        
        districtList += [row[0]]
        AQI_Values += [row[1]]

    fig = plt.figure()
    fig.set_size_inches(16, 10)

    plot = sb.stripplot(x=districtList, y=AQI_Values, alpha=0.7, jitter=True)

    plot.set_xticklabels(labels=plot.get_xticklabels(), rotation=65)
    plot.set(title='AQI Value by Seoul District', xlabel='District', ylabel='AQI Value')

This function plots a gradient plot of the AQI values by District and Date using the CSV and API database tables

def CSVandAPI_GradientPlot(c):
    dateList = []
    districtList = []
    AQI_Values = []

    # CSV and API Data
    for row in c.execute('''SELECT Measurement_Date, District, [AQI_(PM2.5)] FROM CSV_Data 
                            UNION SELECT Measurement_Date, District, [AQI_(PM2.5)] FROM API_Data 
                                ORDER BY Measurement_Date'''):
        dateList += [row[0]]
        districtList += [row[1]]
        AQI_Values += [row[2]]

    dateList = [dt.datetime.strptime(date, '%m/%d/%Y').date() for date in dateList]
    dateList.sort()

    fig = plt.figure()
    fig.set_size_inches(16, 10)
    
    norm = MidpointNormalize(midpoint=75)
    
    plot = plt.scatter(x=dateList, y=districtList, c=AQI_Values, alpha=0.5, cmap='RdYlGn_r', norm=norm)
    
    cbar = plt.colorbar(plot)
    cbar.ax.set_title('AQI Scale')
    
    plt.title('Seoul District AQI Values by Date')
    plt.xlabel('Date')
    plt.ylabel('Districts')

Loading AirData¶

Load AirData Dataframes (CSV, Website, and API Data) into SQLite Database¶

Connecting to the database

# Creates SQlite3 database in the same location as this notebook
conn = sqlite3.connect('AirDatabase.db')
c = conn.cursor()

Creating Tables in AirDatabase.db¶

createSQL_Tables(c)

Load Data from Collected CSV Data Files into AirDatabase.db and Outputting First Few Rows of Each Table to Show its Structure¶

loadData(c)

CSV Data

('1/1/2017', '0:00', 101, 'Republic of Korea', 'Seoul', 'Jongno-gu', '19 Jong-ro 35ga-gil', 37.5720164, 127.0050075, 0.004, 0.059, 0.002, 1.2, 73, 57, 152, 'Unhealthy')
('1/1/2017', '1:00', 101, 'Republic of Korea', 'Seoul', 'Jongno-gu', '19 Jong-ro 35ga-gil', 37.5720164, 127.0050075, 0.004, 0.058, 0.002, 1.2, 71, 59, 153, 'Unhealthy')
('1/1/2017', '2:00', 101, 'Republic of Korea', 'Seoul', 'Jongno-gu', '19 Jong-ro 35ga-gil', 37.5720164, 127.0050075, 0.004, 0.056, 0.002, 1.2, 70, 59, 153, 'Unhealthy')
('1/1/2017', '3:00', 101, 'Republic of Korea', 'Seoul', 'Jongno-gu', '19 Jong-ro 35ga-gil', 37.5720164, 127.0050075, 0.004, 0.056, 0.002, 1.2, 70, 58, 152, 'Unhealthy')
('1/1/2017', '4:00', 101, 'Republic of Korea', 'Seoul', 'Jongno-gu', '19 Jong-ro 35ga-gil', 37.5720164, 127.0050075, 0.003, 0.051, 0.002, 1.2, 69, 61, 154, 'Unhealthy')


Website Data

('1/2017', 'Republic of Korea', 'Seoul', 'Jung-gu', 34, 97, 'Moderate')
('2/2017', 'Republic of Korea', 'Seoul', 'Jung-gu', 29, 87, 'Moderate')
('3/2017', 'Republic of Korea', 'Seoul', 'Jung-gu', 38, 107, 'Unhealthy for Sensitive Groups')
('4/2017', 'Republic of Korea', 'Seoul', 'Jung-gu', 26, 80, 'Moderate')
('5/2017', 'Republic of Korea', 'Seoul', 'Jung-gu', 22, 72, 'Moderate')


API Data

('7/15/2020', 'Republic of Korea', 'Seoul', 'Jongno-gu', 37.572025, 127.005028, 15, 40, 112, 'Unhealthy for Sensitive Groups')
('7/16/2020', 'Republic of Korea', 'Seoul', 'Jongno-gu', 37.572025, 127.005028, 24, 75, 161, 'Unhealthy')
('7/17/2020', 'Republic of Korea', 'Seoul', 'Jongno-gu', 37.572025, 127.005028, 24, 77, 162, 'Unhealthy')
('7/18/2020', 'Republic of Korea', 'Seoul', 'Jongno-gu', 37.572025, 127.005028, 26, 84, 166, 'Unhealthy')
('7/19/2020', 'Republic of Korea', 'Seoul', 'Jongno-gu', 37.572025, 127.005028, 27, 81, 164, 'Unhealthy')

Plotting AirData¶

Table of AQI Histograms with Normalization Curve¶

Plotting the histograms of each of the data tables AQI Values

tableDistPlots(c)

Scatterplot of AQI Values by District using the CSV, Website and API Data Tables¶

This plot shows the overall distribution of AQI values by Seoul District from all three data tables, the higher the AQI Value the worse the air pollution is.

# Higher AQI Value = Bad
CSV_Website_API_ScatterPlot(c)

Gradient Plot of AQI Values by District and Date using the CSV and API Data Tables¶

This plot shows the distribution of AQI values by Seoul District and Date from the CSV and API data tables, using a color gradient scale (based on the AQI value thresholds from the EPA) to show how dangerous that particular AQI value is: Green is Good and Red is Bad.

The gap in the graph is the gap in time between the CSV and API's data collection because the API data was collected far more recently that than the collected CSV data.

# Green Good, Red Bad
CSVandAPI_GradientPlot(c)